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Who am |? 


- Profile - 

> 31 years of experience in the Information Technology Industry, including thirteen years of experience 
working for leading IT consulting firms such as Computer Sciences Corporation 

> PhD in Computer Science from University of Colorado at Boulder 

> Past CEO andCTO 

> Held senior management and technical leadership roles in many large IT Strategy and Modernization 
projects for fortune 500 corporations in the insurance, banking, investment banking, pharmaceutical, retail, 
and information management industries 

> Contributed to several high-profile ARPA and NSF research projects 

> Played an active role as a member of the OMG, ODMG, and X3H2 standards committees and as a 
Professor of Computer Science at Columbia initially and New York University since 1997 

> Proven record of delivering business solutions on time and on budget 

> Original designer and developer of jcrew.com and the suite of products now known as IBM InfoSphere 
DataStage 

> Creator of the Enterprise Architecture Management Framework (EAMF) and main contributor to the creation 
of various maturity assessment methodology 


> Developed partnerships between several companies and New York University to incubate new 

methodologies (e.g., EA maturity assessment methodology developed in Fall 2008), develop proof of 
concept software, recruit skilled graduates, and increase the companies visibility 
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How to reach me? 


— 


Linked [fy 


Cell (212) 203-5004 
Email jcf@cs.nyu.edu 


AIM, Y! IM, ICQ. jcf2_2003 


MSN IM jcf2_2003@yahoo.com 

LinkedIn http://www.linkedin.com/in/jcfranchitti 
Twitter http://Atwitter.com/jcfranchitti 

Skype jcf2_2003@yahoo.com 


What is the course about? 


"Course description and syllabus: 


» http:/Awww.nyu.edu/classes/jct/CSCI-GA.2110-001 suil4 
» http://cs.nyu.edu/courses/summer1 4/G22.2110-001/index.html 


= Textbook: 


» Programming Language Pragmatics (3 Edition) 


S procrammine 
LANGUAGE 
PRAGMATICS 


Michael L. Scott 


ine Morgan Kaufmann 
= ISBN-10: 0-12374-514-4, ISBN-13: 978-0-12374-514-4, (04/06/09) 


Course goals 


= Intellectual: 
» help you understand benefit/pitfalls of different 
approaches to language design, and how they 
work 


=Practical: 


» you may need to design languages in your 
career (at least small ones) 

» understanding how to use a programming 
paradigm can improve your programming even 
in languages that don’t support it 

» Knowing how a feature is implemented helps 
understand time/space complexity 


Icons / Metaphors 


) 
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Common Realization 


Knowledge/Competency Pattern 


i) Governance 


Alignment 


Solution Approach 
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Introduction to Programming Languages - Sub-Topics 


» Introduction 

=» Programming Language Design and Usage Main Themes 
= Programming Language as a Tool for Thought 

= Idioms 

» Why Study Programming Languages 

= Classifying Programming Languages 

= Imperative Languages 

= PL Genealogy 

= Predictable Performance vs. Writeability 

= Common Ideas 

= Development Environment & Language Libraries 
= Compilation vs. Interpretation 

= Programming Environment Tools 

= An Overview of Compilation 

= Abstract Syntax Tree 

= Scannerless Parsing 


Introduction (1/3) @ 


» Why are there so many programming 
languages? 
» evolution -- we've learned better ways of 
doing things over time 


» socio-economic factors: proprietary interests, 
commercial advantage 


» orientation toward special purposes 
» orientation toward special hardware 
» diverse ideas about what is pleasant to use 


Introduction (2/3) @ 


= What makes a language successful? 
» easy to learn (BASIC, Pascal, LOGO, Scheme) 


» easy to express things, easy use once fluent, 
“powerful” (C, Common Lisp, APL, Algol-68, Perl) 


» easy to implement (BASIC, Forth) 

» possible to compile to very good (fast/small) code 
(Fortran) 

» backing of a powerful sponsor (COBOL, PL/1, Ada, 
Visual Basic) 

» wide dissemination at minimal cost (Pascal, Turing, 
Java) 


Introduction (3/3) 


=» Why do we have programming 
languages? What is a language for? 


» way of thinking -- way of expressing 
algorithms 

» languages from the user's point of view 

» abstraction of virtual machine -- way of 
specifying what you want 

» the hardware to do without getting down into 
the bits 


» languages from the implementor's point of 
view 


Programming Language Design and Usage Main Themes (1/2) 


= Model of Computation (i.e., paradigm) 
= Expressiveness 


» Control structures 

» Abstraction mechanisms 

» Types and related operations 

» Tools for programming in the large 


= Ease of use 


» Writeability 

» Readability 

» Maintainability 

» Compactness — writeability/expressibility 
» Familiarity of Model 

» Less Error-Prone 

» Portability 

» Hides Details — simpler model 

» Early detection of errors 

» Modularity - Reuse, Composability, Isolation 
» Performance Transparency 

»  Optimizability 


= Note Orthogonal Implementation Issues: 


» Compile time: parsing, type analysis, static checking 
» Run time: parameter passing, garbage collection, method dispatching, remote invocation, just-in- 
time compiling, parallelization, etc. 


Programming Language Design and Usage Main Themes (2/2) 


= Classical Issues in Language Design: 


» Dijkstra, “Goto Statement Considered Harmful’, 
¢ http://www.acm.org/classics/oct95/#WIRTH66 
» Backus, “Can Programming Be Liberated from the 
von Neumann Style?” 
¢ http:/Awww.stanford.edu/class/cs242/readings/backus.pdf 
» Hoare, “An Axiomatic Basis For Computer 
Programming’, 
¢ http://www.spatial.maine.edu/~worboys/processes/hoare%20axiomatic.paf 
» Hoare, “The Emperor's Old Clothes’, 
¢ http://www.braithwaite-lee.com/opinions/p75-hoare.pdf 
» Parnas, “On the Criteria to be Used in Decomposing 
systems into Modules’, 
¢ http:/www.acm.org/classics/may96/ 


Programming Language as a Tool for Thought 


"= Roles of programming language as a 
communication vehicle among programmers Is 
more important than writeability 

= All general-purpose languages are Turing 
Complete (i.e., they can all compute the same 
things) 

= Some languages, however, can make the 
representation of certain algorithms 
cumbersome 

= Idioms in a language may be useful inspiration 
when using another language 


Idioms 


= Copying a string q top in C: 
» while (*p++ = “q ++) ; 
= Removing duplicates from the list @xs in Perl: 
» my % seen = (); 
@xs = grep {! $seen {$_ }++;} @xs ; 
= Computing the sum of numbers in list xs in 
Haskell: 
» foldr (+) O xs 


Is this natural? ... It is if you're used to it! 


Why Study Programming Languages? (1/6) 


= Help you choose a language. 

» C vs. Modula-3 vs. C++ for systems 
programming 

» Fortran vs. APL vs. Ada for numerical 
computations 

» Ada vs. Modula-2 for embedded systems 

» Common Lisp vs. Scheme vs. ML for 
symbolic data manipulation 

» Java vs. C/CORBA for networked PC 
programs 


Why Study Programming Languages? (2/6) 


= Make it easier to learn new languages 
some languages are similar; easy to walk 
down family tree 


» concepts have even more similarity; if you 
think in terms of iteration, recursion, 
abstraction (for example), you will find it 
easier to assimilate the syntax and semantic 
details of a new language than if you try to 
pick it up in a vacuum 

¢ Think of an analogy to human languages: good 


grasp of grammar makes it easier to pick up new 
languages (at least Indo-European). 


Why Study Programming Languages? (3/6) 


" Help you make better use of whatever 
language you use 


» understand obscure features: 
¢ In C, help you understand unions, arrays & 
pointers, separate compilation, varargs, catch and 
throw 
¢ In Common Lisp, help you understand first-class 
functions/closures, streams, catch and throw, 
symbol internals 


Why Study Programming Languages? (4/6) 


" Help you make better use of whatever 
language you use (cont.) 


» understand implementation costs: choose 
between alternative ways of doing things, 
based on knowledge of what will be done 
underneath: 


— use simple arithmetic equal (use x*x instead of x**2) 
— use C pointers or Pascal "with" statement to factor 
address calculations 
» http:/Awww.freepascal.org/docs-html/ret/refsu51 .html) 
— avoid call by value with large data items in Pascal 
— avoid the use of call by name in Algol 60 


— choose between computation and table lookup (e.g. for 
cardinality operator in C or C++) 


Why Study Programming Languages? (5/6) 


" Help you make better use of whatever 
language you use (cont.) 


» figure out how to do things in languages that 
don't support them explicitly: 

¢ lack of suitable control structures in Fortran 

¢ use comments and programmer discipline for 
control structures 

¢ lack of recursion in Fortran, CSP, etc 

¢ write a recursive algorithm then use mechanical 
recursion elimination (even for things that aren't 
quite tail recursive) 


Why Study Programming Languages? (6/6) 


" Help you make better use of whatever 
language you use (cont.) 


» figure out how to do things in languages that 
don't support them explicitly: 

— lack of named constants and enumerations in Fortran 

— use variables that are initialized once, then never 
changed 

— lack of modules in C and Pascal use comments and 
programmer discipline 

— lack of iterators in just about everything fake them with 
(member?) functions 


Classifying Programming Languages (1/2) 


= Group languages by programming paradigms: 


» imperative 


« von Neumann (Fortran, Pascal, Basic, C, Ada) 


— programs have mutable storage (state) modified by assignments 
— the most common and familiar paradigm 


¢ object-oriented (Simula 67, Smalltalk, Eiffel, 


Ada95, Java, C#) 
— data structures and their operations are bundled together 
— inheritance 


* scripting languages (Perl, Python, JavaScript, PHP) 
» declarative 


¢ functional (applicative) (Scheme, ML, pure Lisp, FP, Haskell) 
— functions are first-class objects / based on lambda calculus 
— side effects (e.g., assignments) discouraged 


¢ logic, constraint-based (Prolog, VisiCalc, RPG, Mercury) 


— programs are sets of assertions and rules 


¢ Functional + Logical (Curry) 
» Hybrids: imperative + OO (C++) 
¢ functional + object-oriented (O’Caml, O’Haskell) 
* Scripting (used to glue programs together) (Unix shells, PERL, PYTHON, TCL 


PHP, JAVASCRIPT) 
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Classifying Programming Languages (2/2) 


» 


» 


» 


» 


» 


» 


= Compared to machine or assembly language, all others are high-level 
= But within high-level languages, there are different levels as well 


= Somewhat confusingly, these are also referred to as low-level and high- 
level 


Low-level languages give the programmer more control (at the cost of requiring 
more effort) over how the program Is translated into machine code. 

¢ C, FORTRAN 
High-level languages hide many implementation details, often with some 
performance cost 

¢ BASIC, LISP, SCHEME, ML, PROLOG, 
Wide-spectrum languages try to do both: 

* ADA, C++, (JAVA) 
High-level languages typically have garbage collection and are often 
interpreted. 
The higher the level, the harder it is to predict performance (bad for real-time or 
performance-critical applications) 


Note other “types/flavors” of languages: fourth generation (SETL, SQL), 
concurrent/distributed (Concurrent Pascal, Hermes), markup, special purpose (report 
writing), graphical, etc. 24 


Imperative Languages 


= Imperative languages, particularly the von 
Neumann languages, predominate 
» They will occupy the bulk of our attention 

" We also plan to spend a lot of time on 
functional, and logic languages 


PL Genealogy @ 


= FORTRAN (1957) => Fortran90, HP 
COBOL (1956) -> COBOL 2000 


» still a large chunk of installed software 
Algol60 => Algol68 => Pascal => Ada 
Algol60 => BCPL => C => C++ 
APL => J 
Snobol => Icon 
Simula => Smalltalk 
Lisp => Scheme => ML => Haskell 


with lots of cross-pollination: 
e.g., Java is influenced by C++, Smalltalk, Lisp, Ada, etc. 


Predictable Performance vs. Writeability 


= Low-level languages mirror the physical 
machine: 

» Assembly, C, Fortran 

= High-level languages model an abstract 
machine with useful capabilities: 

» ML, Setl, Prolog, SQL, Haskell 
= Wide-spectrum languages try to do both: 
» Ada, C++, Java, C# 

" High-level languages have garbage collection, 
are often interpreted, and cannot be used for 
real-time programming. 

» The higher the level, the harder it is to determine cost 
of operations. 


Common Ideas @ 


= Modern imperative languages (e.g., Ada, C++, 
Java) have similar characteristics: 


» large number of features (grammar with several 
hundred productions, 500 page reference manuals, . 


..) 
» a complex type system 
» procedural mechanisms 
» object-oriented facilities 
» abstraction mechanisms, with information hiding 
» several storage-allocation mechanisms 
» facilities for concurrent programming (not C++) 
» facilities for generic programming (new in Java) 


Language Mechanism & Patterns 


= Design Patterns: Gamma, Johnson, Helm, Vlissides 
» Bits of design that work to solve sub-problems 
» What is mechanism in one language is pattern in another 
¢ Mechanism: C++ class 
¢ Pattern: C struct with array of function pointers 
¢ Exactly how early C++ compilers worked 
=» Why use patterns 
» Start from very simple language, very simple semantics 
» Compare mechanisms of other languages by building patterns 
in simpler language 
» Enable meaningful comparisons between language 
mechanisms 
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Development Environment & Language Libraries (1/2) 


= Development Environment 


» Interactive Development Environments 
¢ Smalltalk browser environment 
¢ Microsoft IDE 

>» Development Frameworks 
¢ Swing, MFC 

» Language aware Editors 


Development Environment & Language Libraries (2/2) 


= The programming environment may be larger 
than the language. 
» The predefined libraries are indispensable to the 
proper use of the language, and its popularity 
» Libraries change much more quickly than the 
language 
» Libraries usually very different for different languages 
» The libraries are defined in the language itself, but 
they have to be internalized by a good programmer 
» Examples: 
¢ C++ standard template library 
¢ Java Swing classes 
¢ Ada I/O packages 
¢ C++ Standard Template Library (STL) 


Compilation vs. Interpretation (1/16) 


= Compilation vs. interpretation 
» not opposites 
» not a clear-cut distinction 

= Pure Compilation 


» The compiler translates the high-level source 
program into an equivalent target program 
(typically in machine language), and then 
goes away: 


Source program ———> Compiler —» Target program 


Input ———» Target program )———» Output 


Compilation vs. Interpretation (2/16) 


= Pure Interpretation 


» Interpreter stays around for the execution of 
the program 


» Interpreter is the locus of control during 
execution 


Source program 


— Interpreter ——> Output 
Input = 


Compilation vs. Interpretation (3/16) 


= Interpretation: 
» Greater flexibility 
» Better diagnostics (error messages) 


= Compilation 
» Better performance 


Compilation vs. Interpretation (4/16) 


= Common case is compilation or simple 
pre-processing, followed by interpretation 
= Most language implementations include a 


mixture of both compilation and 
interpretation 


Source program ———> Translator ——+» Intermediate program 


Intermediate program a 
Input eel 


Virtual machine — )———» Output 


Compilation vs. Interpretation (5/16) 


Note that compilation does NOT have to produce 
machine language for some sort of hardware 
Compilation is translation from one language into 
another, with full analysis of the meaning of the 
input 

Compilation entails semantic understanding of 
what is being processed; pre-processing does 
not 

A pre-processor will often let errors through. A 
compiler hides further steps; a pre-processor 
does not 


Compilation vs. Interpretation (6/16) 


= Many compiled languages have 
interpreted pieces, e.g., formats in Fortran 
or C 

= Most use “virtual instructions” 
» set operations in Pascal 
» string manipulation in Basic 

= Some compilers produce nothing but 
virtual instructions, e.g., Pascal P-code, 
Java byte code, Microsoft COM+ 


Compilation vs. Interpretation (7/16) 


= Implementation strategies: 


» Preprocessor 

¢ Removes comments and white space 

¢ Groups characters into tokens (keywords, 
identifiers, numbers, symbols) 

¢ Expands abbreviations in the style of a macro 
assembler 

¢ Identifies higher-level syntactic structures (loops, 
subroutines) 


Compilation vs. Interpretation (8/16) 


= Implementation strategies: 


» Library of Routines and Linking 
¢ Compiler uses a /inker program to merge the 
appropriate /ibrary of subroutines (e.g., math 
functions such as sin, cos, log, etc.) into the final 
program: 


Fortran program ———> Compiler —+» Incomplete machine language 


Incomplete machine 


language ~~» 


> Linker —+» Machine language program 


Library routines 
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Compilation vs. Interpretation (9/16) 


= Implementation strategies: 


» Post-compilation Assembly 


¢ Facilitates debugging (assembly language 
easier for people to read) 

¢ Isolates the compiler from changes in the 
format of machine language files (only 
assembler must be changed, is shared by 
many compilers) 


Source program ——> Compiler —-» Assembly language 


Assembly language ———> Assembler —-» Machine language 


Compilation vs. Interpretation (10/16) 


= Implementation strategies: 


» The C Preprocessor (conditional compilation) 


¢ Preprocessor deletes portions of code, which 
allows several versions of a program to be built 
from the same source 


Source program ———> Preprocessor —+» Modified source program 


Modified source program ——> Compiler ——-» Assembly language 


Compilation vs. Interpretation (11/16) 


= Implementation strategies: 


» Source-to-Source Translation (C++) 


¢ C++ implementations based on the early AT&T 
compiler generated an intermediate program in C, 
instead of an assembly language: 


Source program ——>( Preprocessor _}———> Modified source program 


Modified source program ——-+»( C++ compiler }——~» C code 
C code ——_> C compiler —-» Assembly language 
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Compilation vs. Interpretation (12/16) 


= Implementation strategies: 
» Bootstrapping 


Pascal compiler, Pascal compiler, Pascal compiler, 
in Pascal, in P-code, that in P-code, 
——_—_> 
that generates generates P-code, that generates 
machine language running on the machine language 


P-code interpreter 


Pascal compiler, Pascal compiler, in Pascal compiler, 
in Pascal, P-code, that generates in machine language, 
that generates machine language, that generates 
machine language running on the machine language 


P-code interpreter 
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Compilation vs. Interpretation (13/16) 


= Implementation strategies: 


» Compilation of Interpreted Languages 


¢ The compiler generates code that makes 
assumptions about decisions that won't be 
finalized until runtime. If these assumptions are 
valid, the code runs very fast. If not, a dynamic 
check will revert to the interpreter. 


Compilation vs. Interpretation (14/16) 


= Implementation strategies: 


» Dynamic and Just-in- Time Compilation 


¢ In some cases a programming system may 
deliberately delay compilation until the last 
possible moment. 


— Lisp or Prolog invoke the compiler on the fly, to 
translate newly created source into machine language, 
or to optimize the code for a particular input set. 


— The Java language definition defines a machine- 
independent intermediate form known as byte code. 
Byte code is the standard format for distribution of Java 
programs. 

— The main C# compiler produces .NET Common 
Intermediate Language (CIL), which is then translated 
into machine code immediately prior to execution. 


Compilation vs. Interpretation (15/16) 


= Implementation strategies: 


» Microcode 
¢ Assembly-level instruction set is not implemented 
in hardware; it runs on an interpreter. 
¢ Interpreter is written in low-level instructions 
(microcode or firmware), which are stored in read- 
only memory and executed by the hardware. 


Compilation vs. Interpretation (16/16) 


= Compilers exist for some interpreted languages, 
but they aren't pure: 


» selective compilation of compilable pieces and extra- 
sophisticated pre-processing of remaining source. 


» Interpretation of parts of code, at least, is still 
necessary for reasons above. 
= Unconventional compilers 
» text formatters 
» silicon compilers 
» query language processors 


Programming Environment Tools 


7 — 


type Tri examples 
fctags SSCS 


An Overview of Compilation (1/15) 


= Phases of Compilation 


Character stream 


i 


Scanner (lexical analysis) 


Token stream a 
Front 
Parser (Parser Gynt analysis) analysis) end 
Parse iia 
1, 4 F = 
Abstract syntax tree or i \ = 
other intermediate form en . | 3 
| - . ale 
Modified 4— = Es 
intermediate form 
Back 
— Target code generation end 
Target language a 5 
Des bl 
aaa ear Machine-specific 


Modified ~§ —— code improvement (optional) 
target language 
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An Overview of Compilation (2/15) 


= Scanning: 

» divides the program into "tokens", which are 
the smallest meaningful units; this saves 
time, since character-by-character processing 
is slow 

>» we can tune the scanner better if Its job Is 
simple; it also saves complexity (lots of it) for 
later stages 

» you can design a parser to take characters 
instead of tokens as input, but it isn't pretty 

» scanning is recognition of a regular language, 
e.g., via Deterministic Finite Automata (DFA) 


An Overview of Compilation (3/15) 


" Parsing is recognition of a context-free 
language, e.g., via Push Down Automata 
(PDA) 

» Parsing discovers the "context free" structure 
of the program 


» Informally, it finds the structure you can 
describe with syntax diagrams (the "circles 
and arrows" in a Pascal manual) 


An Overview of Compilation (4/15) 


= Semantic analysis is the discovery of 

meaning in the program 

» The compiler actually does what is called 
STATIC semantic analysis. That's the 
meaning that can be figured out at compile 
time 

» Some things (e.g., array subscript out of 
bounds) can't be figured out until run time. 
Things like that are part of the program's 
DYNAMIC semantics 


An Overview of Compilation (5/15) 


= Intermediate form (IF) done after semantic 
analysis (ifthe program passes all checks) 


» IFs are often chosen for machine independence, 
ease of optimization, or compactness (these are 
somewhat contradictory) 


» They often resemble machine code for some 
imaginary idealized machine; e.g. a stack 
machine, or a machine with arbitrarily many 
registers 

» Many compilers actually move the code through 
more than one IF 


An Overview of Compilation (6/15) 


" Optimization takes an intermediate-code 
program and produces another one that 
does the same thing faster, or in less 
Space 

» The term is a misnomer; we just improve 
code 
» The optimization phase is optional 

= Code generation phase produces 
assembly language or (sometime) 
relocatable machine language 


An Overview of Compilation (7/15) 


= Certain machine-specific optimizations 
(use of special instructions or addressing 
modes, etc.) may be performed during or 
after target code generation 


"= Symbol table: all phases rely on a symbol 
table that keeps track of all the identifiers in 
the program and what the compiler knows 
about them 

» This symbol table may be retained (in some 
form) for use by a debugger, even after 
compilation has completed 


An Overview of Compilation (8/15) 


= Lexical and Syntax Analysis 
» GCD Program (in C) 


int main() { 

int 1 = getint(), J = getint(); 

while (i != 97 
Le (i 2S 4 


An Overview of Compilation (9/15) 


= Lexical and Syntax Analysis 
» GCD Program Tokens 


¢ Scanning (/exical analysis) and parsing 
recognize the structure of the program, groups 
characters into tokens, the smallest meaningful 
units of the program 


int main ( ) { 

int 1 = getint ( ) j 5 = getint ( ) 
while ( i. l= 5 ) { 

aie ( , . 7 ) 1 = 1 = 7 
else 7 = 7 - 1 ; 


} 
putint ( i ) 
} 


An Overview of Compilation (10/15) 


= Lexical and Syntax Analysis 


» Context-Free Grammar and Parsing 


¢ Parsing organizes tokens into a parse tree that 
represents higher-level constructs in terms of their 
constituents 

¢ Potentially recursive rules known as context-free 
grammar define the ways in which these 
constituents combine 


An Overview of Compilation (11/15) 


= Context-Free Grammar and Parsing 
» Example (while loop in C) 


iteration-statement — while ( expression ) statement 


statement, in turn, is often a list enclosed in braces: 
statement — compound-statement 
compound-statement — { block-item-list opt } 
where 

block-item-list opt — block-item-list 

or 

block-item-list opt > € 

and 

block-item-list — block-item 

block-item-list — block-item-list block-item 
block-item — declaration 

block-item — statement 
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An Overview of Compilation (12/15) 


= Context-Free Grammar and Parsing 
» GCD Program Parse Tree 


translation-unit 
i 
| 


function-definition 


declarator declaration-list_opt compound-statement 
pointer_opt direct-declarator € {  block-item-list_opt } 


€  direct-declarator (  identifier-list_opt ) _ block-item-list 


declaration-specifiers ident (main) € block-item-list block-item 


a 


type-specifier  declaration-specifiers_opt block-item-list block-item 
‘1 


int € declaration 
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An Overview of Compilation (13/15) 


= Context-Free Grammar and Parsing (cont.) 


declaration 
declaration-specifiers init-declarator-list_opt  ; 


type-specifier declaration-specifiers_opt _init-declarator-list 


int € init-declarator-list , init-declarator 
init-declarator declarator = initializer 
declarator = initializer pointer_opt  direct-declarator assignment-expression 
| | | 13 
pointer_opt  direct-declarator assignment-expression € ident (j) postfix-expression 
| 13 
€ ident (i) postfix-expression postfix-expression  ( ) 


il 
postfix-expression  (¢ ) ident (getint) argument-expression-list_opt 
i] 
ident (getint) argument-expression-list_opt € 
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An Overview of Compilation (14/15) 


A B 
statement statement 
iteration-statement expression-statement 
while ( expression ) statement expression_opt ; 
17 | 115 
equality-expression compound-statement postfix-expression 


equality-expression ‘!=  relational-expression {  block-item-list_opt } postfix-expression ) 
' 1 ' ' 
18 H 13 il 
' 1 ' ' 
ident (i) ident (j) selection-statement ident (put int) argument-expression-list_opt 
' 17 
if ( expression =) statement else statement ident (i) 
' 
i | | 
relational-expression — expression-statement expression-statement 
relational-expression > — shift-expression expr wsakt_apt 3 expression_opt 3 
' 7 ' 6 1 il 
ident (i) ident (j) assignment-expression assignment-expression 
unary-expression assignment-operator — assignment-expression unary-expression —assignment-operator —_assignment-expression 
12 | 110 12 | 110 
ident (i) = additive-expression ident (j) = additive-expression 
additive-expression - — multiplicative-expression additive-expression - — multiplicative-expression 
‘5 ‘4 ‘5 ‘4 
ident (i) ident (4) ident (3) ident (i) 
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An Overview of Compilation (15/15) 


= Syntax Tree 
» GCD Program Parse Tree 


program 
while 
(5) call 
| (6) call call 
@) | 
( 


Index Symbol Type 


1 void type (6) © 6 - @ = 

2 int type 

3 getint func : (1) — (2) 

4 putint func : (2) — (1) 

5 : (2) (5) (6) (6) (5) 
6 j (2) 


Abstract Syntax Tree (1/2) 


» Many non-terminals inside a parse tree are artifacts of 
the grammar 


= Remember: 
Bees 0) 1. 
T ::= T+ Id | Id 
The parse tree for B « C can be written as 
E(T(Id(B), Id(C))) 
In constrast, an abstract syntax tree (AST) captures only 
those tree nodes that are necessary for representing the 
program 
In the example: 
T(Id(B), Id(C)) 
Consequently, many parsers really generate abstract 
syntax trees. 


Abstract Syntax Tree (2/2) 


=" Another explanation for abstract syntax 
tree: It's a tree capturing only semantically 
relevant information for a program 
> 1.e., omitting all formatting and comments 

= Question 1: What is a concrete syntax 
tree? 

= Question 2: When do | need a concrete 
syntax tree? 


Scannerless Parsing ¢ 


Separating syntactic analysis into lexing and 
parsing helps performance. After all, regular 
expressions can be made very fast 


But it also limits language design choices. For 
example, it's very hard to compose different 
languages with separate lexers and parsers — 
think embedding SQL in JAVA 

scannerless parsing integrates lexical analysis 
into the parser, making this problem more 
tractable. 


jc] 


lia) Conclusion a, 


Sets =f 
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Programming Language Syntax - Sub-Topics 


= Language Definition 

= Syntax and Semantics 
= Grammars 

= The Chomsky Hierarchy 
= Regular Expressions 

= Regular Grammar Example 
= Lexical Issues 

= Context-Free Grammars 
= Scanning 

= Parsing 

= LL Parsing 

= LR Parsing 


Language Definition 


= Different users have different needs: 
» programmers: tutorials, reference manuals, 
programming guides (idioms) 
» Implementors: precise operational 
semantics 


» verifiers: rigorous axiomatic or natural 
semantics 


» language designers and lawyers: all of the 
above 


" Different levels of detail and precision 
» but none should be sloppy! 


Syntax and Semantics 


= Syntax refers to external representation: 
» Given some text, is it a well-formed program? 

= Semantics denotes meaning: 
» Given a well-formed program, what does it mean? 
» Often depends on context 


= The division is somewhat arbitrary 


» Note: 

* Itis possible to fully describe the syntax and semantics of a programming 
language by syntactic means (e.g., Algol68 and W-grammars), but this is 
highly impractical 

¢ Typically use a grammar for the context-free aspects, and different 
method for the rest 

» Similar looking constructs in different languages often have 
subtly (or not-so-subtly) different meanings 

» Good syntax, unclear semantics: “Colorless green ideas sleep 
furiously” 

» Good semantics, poor syntax: “Me go swimming now, sorry 
bye” 

» In programming languages: syntax tells you what a well- 
formed program looks like. Semantic tells you relationship of 
output to input 


Grammars (1/2) 


= Agrammar G is a tuple (2,N, S, 9) 
» N ts the set of non-terminal symbols 
» S is the distinguished non-terminal: the root symbol 
» 2 Is the set of terminal symbols (alphabet) 
» 0 is the set of rewrite rules (productions) of the form: 
ABC... ::= XYZ... 
where A,B,C,D,X,Y, Z are terminals and non terminals 


» The language is the set of sentences containing only 
terminal symbols that can be generated by applying 
the rewriting rules starting from the root symbol (let’s 
call such sentences strings) 


Grammars (2/2) 


= Consider the following grammar G: 
N = {S;X; Y} 
» S=S$ 
» 2= {a;b; c} 
» 6 consists of the following rules: 
->b 
-> XbY 
“>a 
-> aX 
->C 
->YC 
» Some sample derivations: 
°S->b 
¢ S -> XbY -> abY -> abc 
¢ S -> XbY -> aXbY -> aaXbY -> aaabY -> aaabc 


Vv 


<< xXx xKAnW 


The Chomsky Hierarchy 


= Regular grammars (Type 3) 
» all productions can be written in the form: N ::= TN 
» one non-terminal on left side; at most one on right 
= Context-free grammars (Type 2) 
» all productions can be written in the form: N ::= XYZ 
>» one non-terminal on the left-hand side; mixture on right 
= Context-sensitive grammars (Type 1) 
» number of symbols on the left is no greater than on the 
right 
» no production shrinks the size of the sentential form 
= Type-0 grammars 
» no restrictions 


2a 
Regular Expressions (1/3) \ | 


= An alternate way of describing a regular language is 
with regular expressions 
We say that a regular expression R denotes the language [[R]] 
Recall that a language is a set of strings 
Basic regular expressions: 
» € denotes @ 
» a character x, where x € 2, denotes {x} 
» (sequencing) a sequence of two regular expressions RS 
denotes 
» {aB | a © [[RI], B © [[S]]} 
» (alternation) R|S denotes [[R]] U [[S]] 
» (Kleene star) R* denotes the set of strings which are 
concatenations of zero or more strings from [[R]] 
» parentheses are used for grouping 
» Shorthands: 
- R?=E|R 
- Rt =RR* 


Regular Expressions (2/3) 


= A regular expression Is one of the 
following: 

» A character 

» The empty string, denoted by > 

» Two regular expressions concatenated 

» Two regular expressions separated by | 
(1.e., OF) 

» A regular expression followed by the Kleene 
star (concatenation of zero or more strings) 


Regular Expressions (3/3) @ 


= Numerical literals in Pascal may be 
generated by the following: 


digit —> O[1]2|]3]4]5|]6]7]8]|9 
unsigned_integer —+ digit digit * 


unsignednumber —> wunsigned_integer ((. unsignedinteger) | €) 
(((e | E) (+ | - | ©) unsigned_integer) | ©) 


Regular Grammar Example ee 


"= A grammar for floating point numbers: 
» Float ::= Digits | Digits . Digits 
» Digits ::= Digit | Digit Digits 
» Digit ::= 0]1|2|3/4|5|6|7/8|9 
= A regular expression for floating point numbers: 
» (0/1 2|3]4]5|6|7|/8|9)*(.(0]1|2|3/4|5/6|7|8|9)*)’ 
= Perl offer some shorthands: 
» [0 -9]+(\.[0 -9]+)? 
or 
» \d +(\.\ d+)? 


Lexical Issues 


» 
» 
» 
» 
» 


» 
» 


» 


Lexical: formation of words or tokens 


= Tokens are the basic building blocks of programs: 


keywords (begin, end, while). 
identifiers (myVariable, yourType) 
numbers (137, 6:022e23) 
symbols (+, 
string literals (“Hello world”) 


= Described (mainly) by regular grammars 
=» Terminals are characters. Some choices: 


character set: ASCII, Latin-1, ISO646, Unicode, etc. 
is case significant? 


= Is indentation significant? 


Python, Occam, Haskell 


Example: identifiers 


Id ::= Letter IdRest 
IdRest ::= € | Letter IdRest | Digit IdRest 


Missing from above grammar: limit of identifier length 


Other issues: international characters, case-sensitivity, limit of identifier length 
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Context-Free Grammars (1/7) 


=" BNF: notation for context-free grammars 
» (BNF = Backus-Naur Form) Some conventional 
abbreviations: 
¢ alternation: Symb ::= Letter | Digit 
¢ repetition: Id ::= Letter {Symb} 
or we can use a Kleene star: Id ::= Letter Symb* 
for one or more repetitions: Int ::= Digit* 
¢ option: Num ::= Digit*[. Digit*] 
= abbreviations do not add to expressive power 
of grammar 


= need convention for meta-symbols — what if “|” 
is in the language? 


Context-Free Grammars (2/7) 


= The notation for context-free grammars 
(CFG) is sometimes called Backus-Naur 
Form (BNF) 

" A CFG consists of 
» A set of terminals T 
» A set of non-terminals N 
» A start symbol S (a non-terminal) 
» A set of productions 


Context-Free Grammars (3/7) 


= Expression grammar with precedence 
and associativity 


expr —> term | expr add_op term 


term —+ factor | term mult_op factor 


1 
2 
3. factor —+ id | number | - factor | ( expr ) 
4. addcop —> + | - 

5 


multoop —>+ * | / 


Context-Free Grammars (4/7) 


= A parse tree describes the grammatical 
structure of a sentence 
» root of tree is root symbol of grammar 
» leaf nodes are terminal symbols 
» internal nodes are non-terminal symbols 


» an Internal node and Its descendants 
correspond to some production for that non 
terminal 


» top-down tree traversal represents the 
process of generating the given sentence 
from the grammar 


» construction of tree from sentence is parsing 


Context-Free Grammars (5/7) 


" Ambiguity: 
» If the parse tree for a sentence is not unique, the 
grammar is ambiguous: 
E:=E+E|/E+E|lId 
» Two possible parse trees for “A+B: C’: 
¢ ((A +B) + C) 
- (A+(B+C)) 
» One solution: rearrange grammar: 


E:=E+T|T 
Tus T+ Id| Id 
» AGS). problems — disambiguate these (courtesy of 
a): 


¢ function call ::= name (expression list) 
¢ indexed component ::= name (index list) 
¢ type conversion ::= name (expression) 


Context-Free Grammars (6/7) 


= Parse tree for expression grammar (with 
precedence) for 3 + 4 * 5 


expr 
expr add_op term 
term + term mult_op factor 
| | | | 
factor factor ‘ number (5) 


number (3) number (4) 


Context-Free Grammars (7/7) 


= Parse tree for expression grammar (with 
left associativity) for 10 - 4 - 3 


expr 
expr add_op term 
| | 
expr add_op term - factor 
| 
term - factor number (3) 


factor number (4) 


number (10) 


Scanning (1/11) @ 


= Recall scanner is responsible for 
» tokenizing source 
» removing comments 
» (often) dealing with pragmas (i.e., significant 
comments) 
» saving text of identifiers, numbers, strings 


» Saving source locations (file, line, column) for 
error messages 


Scanning (2/11) @ 


= Suppose we are building an ad-hoc (hand- 
written) scanner for Pascal: 
» We read the characters one at a time with look- 


ahead 
= If itis one of the one-character tokens 
cy it i <2 5 7 == = eee | 


we announce that token 
= |If itis a., we look at the next character 
» If that is a dot, we announce . 


» Otherwise, we announce . and reuse the look- 
ahead 


Scanning (3/11) @ 


= lf itis a<, we look at the next character 
» if that is a = we announce <= 
» otherwise, we announce < and reuse the 

look-ahead, etc 

= If itis a letter, we keep reading letters and 
digits and maybe underscores until we 
cant anymore 
» then we check to see if it is a reserved word 


Scanning (4/11) @ 


= lf itis a digit, we keep reading until we find 
a non-digit 
» if that is not a. we announce an integer 
» otherwise, we keep looking for a real number 


» if the character after the . is not a digit we 
announce an integer and reuse the . and the 
look-ahead 


Scanning (5/11) : ? 


Pictorial 
representation 
of a scanner for 
calculator 
tokens, in the 
form of a finite 
automaton 


/ space, tab, newline 
f \ 


\ JS __ _— on 
Start (ys er mt newline *s / 
/ 


~ 
Om \ 
S/S 


}non-newline 


la 
—(Qyr a y 
Iv .—~ » = 
™~ — * +. | 

4) Ce 

\ TH y, 

non-* non-/ or * —— 
? a : TN 


(©)Iparen @) \rparen @) plus @) }minus ‘© times 


a (> ... 
(11) (2) assign 
— »(13) 
— 
digit digit 
digit ( a 
—~ (Aa a 
| na ©) pea ©) 
| number number 
\ letter L ile tter, digit 
ae ——~ 
Oo 
id or keyword 
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Scanning (6/11) @ 


= This is a deterministic finite automaton 
(DFA) 


» Lex, scangen, etc. build these things 
automatically from a set of regular 
expressions 


» Specifically, they construct a machine 
that accepts the language 
identifier | int const 
| real const | comment | symbol 


Scanning (7/11) @ 


=" We run the machine over and over to get 
one token after another 
» Nearly universal rule: 


¢ always take the longest possible token from the 
input 
thus foobar is foobar and never f or foo or foob 


¢ more to the point, 3.14159 is areal const and 
never 3, ., and14159 


= Regular expressions "generate" a regular 
language; DFAs "recognize" it 


Scanning (8/11) @ 


= Scanners tend to be built three ways 
» ad-hoc 
» semi-mechanical pure DFA 
(usually realized as nested case statements) 
» table-driven DFA 


= Ad-hoc generally yields the fastest, most 
compact code by doing lots of special- 
purpose things, though good 
automatically-generated scanners come 
very close 


Scanning (9/11) @ 


= Writing a pure DFA as a set of nested 
case statements is a surprisingly useful 
programming technique 
» though it's often easier to use perl, awk, sed 
» for details (See textbook’s Figure 2.11) 

= Table-driven DFA is what lex and 
scangen produce 
» lex (flex) in the form of C code 


» scangen in the form of numeric tables and a 
separate driver (for details see textbook’s 
Figure 2.12) 


Scanning (10/11) @ 


= Note that the rule about longest-possible 
tokens means you return only when the next 
character can't be used to continue the 
current token 
» the next character will generally need to be saved 
for the next token 
= In some cases, you may need to peek at 
more than one character of look-ahead in 
order to know whether to proceed 
» In Pascal, for example, when you have a 3 and 


you a see a dot 
¢ do you proceed (in hopes of getting 3.14)? 
or 


¢ do you stop (in fear of getting 3..5)? 


Scanning (14/11) @ 


= In messier cases, you may not be able to 
get by with any fixed amount of look- 
ahead.In Fortran, for example, we have 
DO > tf = izzo 1000p 
DO 5 l= 1.25 assignment 


= Here, we need to remember we were ina 
potentially final state, and save enough 
information that we can back up to it, if we 
get stuck later 


Parsing (1/7) 


= Terminology: 
» context-free grammar (CFG) 


» symbols 
¢ terminals (tokens) 
¢ non-terminals 


» production 


» derivations (left-most and right-most - 
canonical) 


» parse trees 
» sentential form 


Parsing (2/7) 


= By analogy to RE and DFAs, a context- 
free grammar (CFG) is a generator for a 
context-free language (CFL) 
» a parser Is a language recognizer 

= There is an infinite number of grammars 
for every context-free language 
» not all grammars are created equal, however 


Parsing (3/7) 


= It turns out that for any CFG we can 
create a parser that runs in O(n‘3) time 
= There are two well-known parsing 
algorithms that permit this 
» Early's algorithm 
» Cooke-Younger-Kasami (CYK) algorithm 
= O(n’s) time is clearly unacceptable for a 
parser in a compiler - too slow 


ea 
Parsing (4/7) ¥7 


= Fortunately, there are large classes of 
grammars for which we can build parsers 
that run in linear time 
» The two most important classes are called 

LL and LR 

= LL stands for 
'Left-to-right, Leftmost derivation’. 

= LR stands for 
'Left-to-right, Rightmost derivation’ 
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Parsing (5/7) 


= LL parsers are also called ‘top-down’, or 
‘predictive’ parsers & LR parsers are also 
called ‘bottom-up’, or ‘shift-reduce' parsers 
= There are several important sub-classes of 
LR parsers 
» SLR 
» LALR 


= We won't be going into detail of the 
differences between them 


Parsing (6/7) 


= Every LL(1) grammar is also LR(1), though 
right recursion in production tends to 
require very deep stacks and complicates 
semantic analysis 


= Every CFL that can be parsed 
deterministically has an SLR(1) grammar 
(which is LR(1)) 

" Every deterministic CFL with the prefix 
property (no valid string Is a prefix of 
another valid string) has an LR(O) grammar 


Parsing (7/7) 


= You commonly see LL or LR (or 
whatever) written with a number in 
parentheses after it 
» This number indicates how many tokens of 
look-ahead are required in order to parse 
» Almost all real compilers use one token of 
look-ahead 
= The expression grammar (with 
precedence and associativity) you saw 
before is LR(1), but not LL(1) 


LL Parsing (1/23) ) 


Oo OwAD OF WN EF 


Here is an LL(1) grammar (Fig 2.15): 


program = Stic. list 535 
SLmG list +> stmt stmt list 
| ¢ 
stmt — id := expr 
| read id 


| write expr 
expr Lerm Term, ai. 
term tail — add op term term tail 
|< 
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LL(1) grammar (continued) 


corm > fector fact. taiiet 
fact.-tail. -» Mult Op Tact. t4cee. tail 
| £ 


tacltor = ( expr ) 
| ad 
| number 
add op = - 
| 


mult op > * 
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LL Parsing (3/23) @ 


= Like the bottom-up grammar, this one 
captures associativity and precedence, 
but most people don't find it as pretty 
» for one thing, the operands of a given 
operator aren't in a RHS together! 
» however, the simplicity of the parsing 
algorithm makes up for this weakness 
= How do we parse a string with this 
grammar? 
» by building the parse tree incrementally 


LL Parsing (4/23) Sear 


= Example (average program) 
read A 
read B 
sum := A + B 
write sum 
write sum / 2 
= We start at the top and predict needed 
productions on the basis of the current left- 
most non-terminal in the tree and the current 
input token 
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LL Parsing (5/23) ei 


" Day tree for the average program (Figure 


program 


stmt_list $$ 


stmt stmt_list 
read id(A) ~ stmt stmt_list 
a 
read id(B) stmt stmt_list 
il Say, ee 
id(sum) := expr stmt stmt_list 

| a, te, 
term term_tail write expr stmt stmt_list 

factor factor_tail add_op term term_tail term term_tail write expr | 

ee eo, ee ee, 

id (A) € + factor factor_tail e€ factor factor_tail € term term_tail 

id (B) € id (sum) € factor — factor_tail € 


id(sum) mult_op factor factor_tail 


if number (2) € 
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LL Parsing (6/23) @ 


= Table-driven LL parsing: you have a big 
loop in which you repeatedly look up an 
action in a two-dimensional table based 
on current leftmost non-terminal and 
current input token. The actions are 
(1) match a terminal 
(2) predict a production 
(3) announce a syntax error 


LL Parsing (7/23) at 


= LL(1) parse table for parsing for calculator 


language 
Top-of-stack Current input token 
nonterminal id number read write := (_ ) + - * f/f $$ 
program | 1 — 1 1 1 
stmt_list | 2 - 2 2 3 
stmt | 4 — 5 6 
expr | 7 7 — — - 
term_tail | 9 — 9 9 - - 9 8 8 = = 9 
term | 10 10 _ a 2 J 
factor_tail | 12 - 12 12 - - DPwvwrRitiTtt gb 
factor | 14 L5 — ~ —- 13 
addop | — — — — 16 17 
mult.op | — - — Is iJ = 


LL Parsing (8/23) 


= To keep track of the left-most non- 
terminal, you push the as-yet-unseen 
portions of productions onto a stack 
» for details see Figure 2.20 


= The key thing to keep in mind is that the 
Stack contains all the stuff you expect to 
see between now and the end of the 
program 
» what you predict you will see 


LL Parsing (9/23) te 


= Problems trying to make a grammar LL(1) 
» left recursion 
¢ example: 
id Jist = 2c | 10 138t , 1c 
equivalently 
10. bast = 70 10 Jast tail 
20; ist. tale =» 2c te Jae Tei. 
| epsilon 


¢ we can get rid of all left recursion mechanically in 
any grammar 
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LL Parsing (10/23) ee 


= Problems trying to make a grammar 
LL(1) 
>» common prefixes: another thing that LL 
parsers can't handle 
¢ solved by "left-factoring” 


¢ example: 
stmt = 10d: = expr | ad { arg Jast ) 
equivalently 
Stme = 10 10. Stme. cai. 
Ld Sti. Gai = = expr 
| { epg 17st) 


¢ we can eliminate left-factor mechanically 
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LL Parsing (11/23) @ 


= Note that eliminating left recursion and 
common prefixes does NOT make a 
grammar LL 
» there are infinitely many non-LL 
LANGUAGES, and the mechanical 
transformations work on them just fine 


» the few that arise in practice, however, can 
generally be handled with kludges 


LL Parsing (12/23) 


= Problems trying to make a grammar LL(1) 


» the"dangling else" problem prevents 
grammars from being LL(1) (or in fact LL(k) 
for any k) 


» the following natural grammar fragment is 
ambiguous (Pascal) 


SUME =« or COnd then Clause else clause 
| Cher Seure 


then: Clause += then Stmt 
else Clause = Glee Sime 


| epsilon 


heetoof 


7 
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LL Parsing (13/23) at 


= Consider: S::=ifEthenS 
Siu=ifEthenS elses 
= The sentence 
ifE1 then if E2 then S1 else S2 
is ambiguous (Which then does else S2 match’) 
=" Solutions: 
» Pascal rule: else matches most recent if 


» grammatical solution: different productions for balanced 
and unbalanced 


» if-statements 
» grammatical solution: introduce explicit end-marker 


= The general ambiguity problem is unsolvable 
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LL Parsing (14/23) 


= The less natural grammar fragment can be 
parsed bottom-up but not top-down 


stmt — balanced stmt | unbalanced stmt 
balanced stmt > if cond then balanced stmt 


else Dalanced. stme 
| OCher Secure 


unbalanced stmt — if cond then stmt 
| a& cond then. balanced. stmt 


else unbalanced Stmt 
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LL Parsing (15/23) @ 


= The usual approach, whether top-down 
OR bottom-up, is to use the ambiguous 
grammar together with a aisambiguating 
rule that says 
» else goes with the closest then or 


» more generally, the first of two possible 
productions is the one to predict (or reduce) 


LL Parsing (16/23) 


= Better yet, languages (since Pascal) generally 
employ explicit end-markers, which eliminate 
this problem 
= In Modula-2, for example, one says: 
if A = B then 
1f © = D then = t=] F end 


else 


end 
= Ada says ‘end if’; other languages say ‘fi’ 


Kec f 


7 
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LL Parsing (17/23) 


= One problem with end markers is that 
they tend to bunch up. In Pascal you say 


if A =B then ... 

else 1f A =C then ... 
else if A = D then ... 
else if A = E then ... 
else <<} 


= With end markers this becomes 


if A =B then ... 

else if A = C then ... 
else if A = D then ... 
else if A = 

CelsSe «a«cZ 
end; end; end; end; 


KecRof 


7 
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LL Parsing (18/23) 


= The algorithm to build predict sets is 
tedious (for a "real" sized grammar), but 
relatively simple 


= It consists of three stages: 
» (1) compute FIRST sets for symbols 
» (2) compute FOLLOW sets for non-terminals 
(this requires computing FIRST sets for some 
strings) 
» (8) compute predict sets or table for all 
productions 


LL Parsing (19/23) 


= It is conventional in general discussions of 
grammars to use 


» lower case letters near the beginning of the alphabet 
for terminals 


» lower case letters near the end of the alphabet for 
strings of terminals 


» upper case letters near the beginning of the alphabet 
for non-terminals 


» upper case letters near the end of the alphabet for 
arbitrary symbols 


» greek letters for arbitrary strings of symbols 
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LL Parsing (20/23) er 


e Algorithm First/Follow/Predict: 


- FIRST (a) == {a : a —* a 8B} 
U (1f a =>* ¢©« THEN {¢} ELSE NULL) 


— FOLLOW(A) == {a :S ss aAa B} 
U (1f S 3=* a A THEN {¢} ELSE NULL) 


=~Predict (5 % «es, %,) == (EIRST -( sa. 


m 


x) @~ tel) WU tae xy seep 2 @ ES Then 


m 


FOLLOW (A) ELSE NULL) 


" Details following... 
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LL Parsing (21/23) pe 


program —> stmt_list $$ 


stmt_list —> stmt stmt_list 
stmt_list —> € 

stmt —>+ id := expr 

stmt ——> read id 

stmt —+ write expr 

expr — > term term_tail 

term_tail —+ add_op term term_tail 
term_tail —> € 

term —= factor factor_tail 


factor_tail —+ mult_op factor factor_tail 


factor_tail —> e€ 
factor —> ( expr ) 
factor —+ id 
factor —> number 
add.op —> + 
add.op —> - 
mult.op —> * 
mult.op —> / 


$$ € FOLLOW(stmt_list), 
€ € FOLLOW($$), and € € FOLLOW( program) 


€ € FIRST(stmt_list) 

id € FIRST(stmt) and := € FOLLOW(id) 
read € FIRST(stmt) and id € FOLLOW(read) 
write € FIRST(stmt) 


€ € FIRST(term-_tail) 


€ € FIRST(factor_tail) 

( € FIRST(factor) and ) € FOLLOW(ezpr) 
id € FIRST( factor) 

number € FIRST( factor) 

+ € FIRST(add_op) 

- € FIRST(add_op) 

* © FIRST(mult_op) 

/ © FIRST(mult_op) 


Figure 2.21: “Obvious” facts about the LL(1) calculator grammar. 


124 


LL Parsing (22/23) 


FIRST expr {), id, read, write, $$} 
program {id, read, write, $$} term_tail {), id, read, write, $$} 
stmt_list {id, read, write, e} term {+, -, ), id, read, write, $$} 
stmt {id, read, write} factortail {+,-, ), id, read, write, $$} 
expr {(, id, number} factor {+, -, *, /, ), id, read, write, $$} 
term_tail {+, -, e} add_op {(, id, number} 
term {(, id, number } mult_op {(, id, number } 


factor_tail {*, /, €} 


factor {(, id, number} PREDICT 


1 program —-+ stmt_list $$ {id, read, write, $$ 
add_op {+,-} 2 = stmt list —+ stmt me {id, read, write} 
melo ys. #5 3 stmt_list —+ {$$} 

Also note that FIRST(a) = {a} V tokens a. 1. St —e 00 Pace etal 
FOLLOW 5 stmt —+ read id {read} 
id {+,-, *,/, ), :=, id, read, write, $$} 6 stmt —+ write expr {write} 
number {+, -, *, /, ), id, read, write, $$} 7 expr —> term term_tail {(, id, number} 
read {id} 8 term_tail —+ add_op term term_tail {+, -} 
write {(, id, number} 9 termtail —+e {), id, read, write, $$} 
( {(, id, number} 10 term —- factor factortail {(, id, number} 
) {+,-, *, /, ), id, read, write, $$} 11 factor_tail —+ mult_op factor factor_tail {*, /} 
:= 1(, id, number} 12 factor_tail —+e {+, -, ), id, read, write, $$} 
+ {(, id, number } 13 factor —+ ( expr) {(} 
- {(, id, number} 14. factor —+id {id} 
* {(, id, number} 15 factor —+ number {number} 
/ {(, id, number} 16 add_op —++ {+} 
$$ {Ee} 17 add_op —-+- {-} 
program {e} 18 =multLop —+ * {*} 
stmt_list {$$} 19 multop —+/ {/} 


stmt {id, read, write, $$} 
Figure 2.22: FIRST, FOLLOW, and PREDICT sets for the calculator language. 
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LL Parsing (23/23) @ 


= If any token belongs to the predict set 
of more than one production with the 
same LHS, then the grammar is not 
LL(1) 
" Aconilict can arise because 
» the same token can begin more than one 
RHS 
» it can begin one RHS and can also appear 
after the LHS in some valid program, and 
one possible RHS is > 


LR Parsing (1/11) @ 


= LR parsers are almost always table- 
driven: 


» like a table-driven LL parser, an LR parser 
uses a big loop in which it repeatedly 
inspects a two-dimensional table to find out 
what action to take 


» unlike the LL parser, however, the LR driver 
has non-trivial state (like a DFA), and the 
table is indexed by current input token and 
Current state 


» the stack contains a record of what has been 
seen SO FAR (NOT what is expected) 


127 


LR Parsing (2/11) @ 


= A scanner is a DFA 
» it can be specified with a state diagram 
= An LL or LR parser is a PDA 
» Early's & CYK algorithms do NOT use PDAs 
» a PDA can be specified with a state diagram 
and a stack 


¢ the state diagram looks just like a DFA state 
diagram, except the arcs are labeled with <input 
symbol, top-of-stack symbol> pairs, and in 
addition to moving to a new state the PDA has the 
option of pushing or popping a finite number of 
symbols onto/off the stack 
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LR Parsing (3/11) @ 


#" An LL(1) PDA has only one state! 

» well, actually two; it needs a second one to 
accept with, but that's all (it's pretty simple) 

» all the arcs are self loops; the only difference 
between them is the choice of whether to 
push or pop 

» the final state is reached by a transition that 
sees EOF on the input and the stack 
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LR Parsing (4/11) @ 


» An SLR/LALR/LR PDA has multiple states 
» itis a "recognizer," not a "predictor" 
» it builds a parse tree from the bottom up 
» the states keep track of which productions we might 
be in the middle 
= The parsing of the Characteristic Finite State 
Machine (CFSM) is based on 
» Shift 
» Reduce 
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LR Parsing (5/11) 


CON DA OF WN EF 


To illustrate LR parsing, consider the 
grammar (Figure 2.24, Page 73): 


. program + stmt list SSss 
Stic Ja1st. = sunt 21st sume 
| stmt 
stmt > 1d := expr 
| read id 


| write expr 
expr — Lerm 
| expr add op term 


Keio) 


7 
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LR Parsing (6/11) : 


LR grammar (continued): 


term = heacLor 
| Term MUI Op Zacror 
factor —( expr ) 
|) 2 
| number 
add op — + 
| _— 
MOI Oyo. Ss: 


| / 
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LR Parsing (7/11) @ 


«= This grammar is SLR(1), a particularly 
nice class of bottom-up grammar 
» It isn't exactly what we saw originally 
» we've eliminated the epsilon production to 
simplify the presentation 
= For details on the table driven SLR(1) 
parsing please note the following slides 


LR Parsing (8/11) 


0. program —> . stmttist $$ 


stmttist —+ . stmt_list stmt 
stmtJist —> . stmt 
stmt —+. id := expr 
stmt —+. read id 
stmt —>+ . write expr 
1. stmt —+read . id 


2. program —-» stmtlist . $$ 
stmtwtist —+ stmttist . stmt 


stmt —+. id := expr 
stmt —+. read id 
stmt —+ . write expr 
3. stmt —+id . := ezpr 
4. stmt —+write . expr 
expr —~ « term 
expr —> « expr add_op term 


term —> . factor 

term —». term multlop factor 
factor —+ . ( expr) 

factor —+. id 

factor —+ . number 


stmt —+ id 


or 


I=. expr 


expr —> . term 

expr —~+ . expr add.op term 
term —+ . factor 

term —> . term mult.op factor 
factor —-+ « ( expr) 

factor —+. id 

factor —~ . number 


6. stmt —+ write expr. 
stmt —> expr . add_op term 


add_op —+» . + 
addLop —+ « - 


on 


on 
on 


stmtJist shift and goto 2 


stmt shift and reduce (pop 1 state, pus 
id shift and goto 3 


on read shift and goto 1 
on write shift and goto 4 


on 


on 
on 


on 


id shift and reduce (pop 2 states, push 


$$ shift and reduce (pop 2 states, push 
stmt shift and reduce (pop 2 states, pus 


id shift and goto 3 


on read shift and goto 1 
on write shift and goto 4 


on 


on 


on 


:= shift and goto 5 


expr shift and goto 6 
term shift and goto 7 


on factor shift and reduce (pop 1 state, pu 


on 
on 


( shift and goto 8 
id shift and reduce (pop 1 state, push ) 


on number shift and reduce (pop 1 state, p 


on 


om 


expr shift and goto 9 


term shift and goto 7 


on factor shift and reduce (pop 1 state, pu 


on 
on 


( shift and goto 8 
id shift and reduce (pop 1 state, push j 


on number shift and reduce (pop 1 state, p 


on FOLLOW( stmt) = {id, read, write. $$ 


on 


(pop 2 states, push stmt on input) 
add_op shift and goto 10 


on + shift and reduce (pop 1 state, push ac 


on 


- shift and reduce (pop 1 state, push ac 


Figure 2.25: CFSM for the calculator grammar (Figure 2.24). Basi 
items in each state are separated by a horizontal rule. Trivial reduce-only sta 
eliminated by use of “shift and reduce” transitions (continued). 


10. 


15). 


12. 


13. 


expr —~+ term « 
term —+ term . mult.op factor 


mult.op —> « * 
mult_op —> . / 


factor —+( .« expr) 


erpr —> « term 

expr —> « expr add.op term 
term —+ . factor 

term —+ . term mult.op factor 
factor —+ . ( expr ) 

factor —- . id 

factor —+ . number 


stmt —+ id := expr. 
expr —> expr - add_op term 


addlop —+ . + 
addop —+ « - 


expr —> erpr addop « term 


term —+ . factor 

term —+ . term multlop factor 
factor —+ . ( expr) 

factor —+ . id 

factor —+ . number 


term —+ term mult.op . factor 


factor — 
factor —- 
factor — 


- ( expr) 

- id 

+ number 

factor —+ ( expr. ) 

expr —> erpr - addop term 
addop —+ « + 

add_op —> . - 


expr —~ expr add_op term . 
term —+ term . mult_op factor 


mult_op —> . * 
mult_op —+. / 


Transitions 


on FOLLOW(erpr) = {id, read, write, $$, ), +, -} reduce 
(pop 1 state, push expr on input) 

on multop shift and goto 11 

on * shift and reduce (pop 1 state, push mulfep on input) 

on / shift and reduce (pop 1 state, push mult_op on input) 


on expr shift and goto 12 


on term shift and goto 7 
on factor shift and reduce (pop 1 state, push term on input) 


on ( shift and goto 8 
on id shift and reduce (pop 1 state, push factor on input) 
on number shift and reduce (pop 1 state, push factor on input) 


on FOLLOW (stmt) = {id, read, write, $$} reduce 
(pop 3 states, push stmf on input) 
on add_op shift and goto 10 
on + shift and reduce (pop 1 state, push addlop on input) 
on — shift and reduce (pop 1 state, push add_op on input) 


on term shift and goto 13 


on factor shift and reduce (pop 1 state, push term on input) 


on ( shift and goto 8 
on id shift and reduce (pop 1 state, push factor on input) 
on number shift and reduce (pop 1 state, push factor on input) 


on factor shift and reduce (pop 3 states, push term on input) 


on ( shift and goto 8 
on id shift and reduce (pop 1 state, push factor on input) 
on number shift and reduce (pop 1 state. push factor on input) 


on ) shift and reduce (pop 3 states, push factor on input) 
on addlop shift and goto 10 


on + shift and reduce (pop 1 state, push add_op on input) 
on - shift and reduce (pop 1 state, push add_op on input) 


on FOLLOW(erpr) = {id, read, write, $$, ), +, -} reduce 
(pop 3 states, push ezpr on input) 

on mult_op shift and goto 11 

on * shift and reduce (pop 1 state, push mulf_op on input) 

on / shift and reduce (pop 1 state, push mult_op on input) 


Figure 2.25: (continued) 
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LR Parsing (9/11) 


Figure 2.26: Pictorial representation of the CFSM of Figure 2.25. Symbol names 
have been abbreviated for clarity. Reduce actions are not shown. 


135 


LR Parsing (10/11) h 


Top-of-stack 


Current input symbol 


state sl s e t f «aomo id Tit £ wi= ¢ j) + - * / $$ 
8) s2 b3 s3 — sl s4 

1 b5 

2 — b2 83 — sl s4 bl 
3 St) 

A —- — s6 s? b9O9 - — bl2 b13 - —- — 388 

5 —- — s§9 s?¥ bd - — bl2 b13 -—- - — 58 

6 slO — r6 — r6 r6 — — — bl4 b15 — — x6 
t all Ff — xr? r? — — xr? xr? rv? bilé bl7 r7 
8 —- — gs12 sf b9 - —- bl12 b1I3 -—- - — 5388 

9 sl0 —- r4 — r4r4 — — — bl14 bi5 - — 4 
10 —- = — s13 b9 -—- — bl2 b13 - - — s8 

ii b10 — — bl2 b13 - - — 38 

12 slO bLL bié. BIS 

13 sll r&8 — r8 r8 — —- r8& r8& r8 bi16 bl17 r8 


Figure 2.27: SLR(1) parse table for the calculator language. Table entries indicate 
whether to shift (s), reduce (r), or shift and then reduce (b). The accompanying number 
is the new state when shifting, or the production that has been recognized when (shifting 
and) reducing. Production numbers are given in Figure 2.24. Symbol names have been 
abbreviated for the sake of formatting. A dash indicates an error. An auxiliary table, not 
shown here, gives the left-hand side symbol and right-hand side length for each production. 
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LR Parsing (11/11) 


= SLR parsing Is 
based on 
» Shift 
» Reduce 
and also 
» Shift & Reduce 


for 


optimization 


Parse stack 


0 
O read I 
0 
0 


O stmtlist 2 
O stmtJist 4 
O stmtJist 2 


0 


O stmtlist 4 
O stmttist 2 
O stmtJist 4 
0 stmtJist 2 i 
O stmtlist 2 i 
O stmtJist 2 
O stmtJist 2 i 
O stmtJist 2 
O stmtJist 4 
O stmtist 2 
O stmtist 4 
O stmtJist 4 
O stmtJist 2 


O stmtJist ‘ 
O stmtJist 4 
O stmtist 4 


oO 


O stmtJist 4 
O stmtJist ¢ 
O stmttist 
O stmtist 4 
O stmtJist 4 
O stmtJist 2 
O stmtJist 4 
O stmtist 2 


0 


O stmtJist 4 
O stmtJist 4 
O stmttist 4 
O stmtist 4 
O stmtist 2 
O stmtJist 4 
O stmtJist 2 
O stmtist 4 
O stmtlist 4 
O stmtist 2 
O stmtJist ¢ 
O stmtJist 4 
O stmtJist 2 


0 


O stmtlist 2 


0 
{done} 


b 
a 


ad 


a 


RRR RRR 


a 


term 7 


expr 9 

expr 9 

expr 9 addop 10 
expr 9 add.op 10 
expr 9 addlop 10 
expr 9 

op 10 term 12 


or or Gr Gr Gy Gy Ge Ge Ge Gr Ge 


a 


expr 9 


term 7 


erpr 6 


term 7 
term 7 
term 7 multop 11 
term 7 multop 11 


term 7 


expr 6 


Input stream 


read A read B... 
AreadB... 

stmt read B ... 
stmt_list read B... 
read B sum... 

B sum := 
stmt sum =... 
stmt_list sum :=... 
sum :=A... 

teat Won 

A+ Biss 
factor+B... 
term +B... 

+ Bwrite... 
erpr + B write... 
+ B write... 
add.op B write... 
B write sum... 


factor write sum... 


term write sum... 


write sum... 

erpr write sum... 
write sum. 

stmt write sum... 


stmt_list write sum... 


write sum. 
sum write sum... 


factor write sum... 


term write sum... 
write sum... 
expr write sum... 
write sum... 
stmt write sum... 


stmt_list write sum... 


write sum /... 
sum / 2... 
factor / 2... 
term / 2... 
/ 2 $$ 
mult_op 2 $$ 
2 $$ 

factor $$ 
term $$ 

$$ 

erpr $$ 

$$ 

stmt $$ 
stmt_list $$ 
$$ 

program 


Comment 


shift read 

shift id(A) & reduce by stmt 
shift stmt & reduce by stmtJist 
shift stmt_list 

shift read 

shift id(B) & reduce by stmt 
shift stmt & reduce by stmtist 
shift stmt_list 

shift id (sum) 

shift := 

shift id(A) & reduce by factor 


- read id 
» stmt 


- read id 
- stmtlist stmt 


-id 
shift factor & reduce by term —— factor 
shift term 
reduce by expr - term 
shift expr 
shift + & reduce by addlop + 
shift addlop 
shift id(B) & reduce by factor -id 
shift factor & reduce by term —- factor 
shift term 
reduce by expr - expr add_op term 
shift expr 


reduce by stmt -id := erpr 
shift stmt & reduce by stmtist 
shift stmt list 

shift write 

shift id¢sum) & reduce by factor 
shift factor & reduce by term 
shift term 


reduce by expr —= term 
shift expr 
reduce by stmt —~ write erpr 


shift stmt & reduce by stmtJist 
shift stmt_list 

shift write 

shift id¢sum) & reduce by factor 
shift factor & reduce by term 
shift term 

shift / & reduce by mult.op 
shift mult.op 


shift number(2) & reduce by factor 


shift factor & reduce by term 
shift term 

reduce by expr -term 

shift expr 

reduce by stmt -write erpr 
shift stmt & reduce by stmtist 
shift stmtlist 

shift $$ & reduce by program 


» stmt 


»id 


» factor 


- stmtlist stmt 


-id 
» factor 


+f 


» number 
» term mult.op factor 


- stmtjist stmt 


- stmt_list $$ 


Figure 2.29: Trace of a table-driven SLR(1) parse of the sum-and-average pro- 
gram. States in the parse stack are shown in boldface type. Symbols in the parse stack 
are for clarity only; they are not needed by the parsing algorithm. Parsing begins with the 
initial state of the CFSM (State 0) in the stack. It ends when we reduce by program —- 
stmt_list $$. uncovering State 0 again and pushing program onto the input stream. 
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= a) Conclusion ™ 


SS Fa = 
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Assignments & Readings 


= Readings 


ij » Foreword/Preface, Chapters 1 and 2 (in particular, section 2.2.1) 


» Assignment #1 
» See Assignment #1 posted under “handouts” on the course Web site 


» Due on June 12, 2014 by the beginning of class 
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Recommended Reference Books 


= The books written by the creators of C++ and Java are the standard 
references: 
» Stroustrup. The C++ programming Language, 3rd ed. (Addison-Wesley) 
» Ken Arnold, James Gosling, and David Holmes. The Java(TM) Programming Language, 4th ed. 
(Addison-Wesley) 


= For the remaining languages, there Is a lot of information available on 

the web in the form of references and tutorials, so books may not be 
Strictly necessary, but a few recommended textbooks are as follows: 

» John Barnes. Programming in Ada95, 2nd ed. (Addison Wesley) 

» Lawrence C. Paulson. ML for the Working Programmer, 2nd ed. Cambridge University Press 

» David Gelernter and Suresh Jagannathan: “Programming Linguistics”, MIT Press, 1990 

» Benjamin C. Pierce: “Types and Programming Languages’, MIT Press, 2002 

» Larry Wall, Tom Christiansen, and Jon Orwant: Programming Perl, 3rd ed. (O'Reilly) 

» Giannesini et al: “Prolog”, Addison-Wesley 1986. 

» Dewhurst & Stark, “Programming in C++”, Prentice Hall, 1989. 

» Ada 95 Reference Manual, http://www.adahome.com/rm95/ 

» MIT Scheme Reference 

¢ http:/www-swiss.ai.mit.edu/projects/scheme/documentation/scheme.html 

» Strom etal: “Hermes: A Language for Distributed Computing”, Prentice-Hall, 1991. 

» R. Kent Dybvig, “The SCHEME Programming Language”, Prentice Hall, 1987 

» Jan Skansholm, “ADA 95 From the Beginning”, Addison Wesley, 1997. 
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Next Session: Imperative Languages — Names, Scoping, and Bindings 
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